A Novel Bayesian Cluster Enumeration Criterion for Unsupervised Learning

نویسندگان

  • Freweyni K. Teklehaymanot
  • Michael Muma
  • Abdelhak M. Zoubir
چکیده

The Bayesian Information Criterion (BIC) has been widely used for estimating the number of data clusters in an observed data set for decades. The original derivation, referred to as classic BIC, does not include information about the specific model selection problem at hand, which renders it generic. However, very little effort has been made to check its appropriateness for cluster analysis. In this paper we derive BIC from first principle by formulating the problem of estimating the number of clusters in a data set as maximization of the posterior probability of candidate models given observations. We provide a general BIC expression which is independent of the data distribution given some mild assumptions are satisfied. This serves as an important milestone when deriving BIC for specific data distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed observations. We show that incorporating the clustering problem during the derivation of BIC results in an expression whose penalty term is different from the penalty term of the classic BIC. We propose a two-step cluster enumeration algorithm that utilizes a modelbased unsupervised learning algorithm to partition the observed data according to each candidate model and the proposed BIC for selecting the model with the optimal number of clusters. The performance of the proposed criterion is tested using synthetic and real-data examples. Simulation results show that our proposed criterion outperforms the existing BIC-based cluster enumeration methods. Our proposed criterion is particularly powerful in estimating the number of data clusters when the observations have unbalanced and overlapping clusters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Model-Averaging in Unsupervised Learning From Microarray Data

Unsupervised identification of patterns in microarray data has been a productive approach to uncovering relationships between genes and the biological process in which they are involved. Traditional model-based clustering approaches as well as some recently developed model-based mining approaches for integrating genomic and functional genomic data rely on one’s ability to determine the correct ...

متن کامل

Unsupervised training of Bayesian networks for data clustering

This paper presents a new approach to the unsupervised training of Bayesian network classifiers. Three models have been analysed: the Chow and Liu (CL) multinets; the treeaugmented naive Bayes; and a new model called the simple Bayesian network classifier, which is more robust in its structure learning. To perform the unsupervised training of these models, the classification maximum likelihood ...

متن کامل

Refining A Divisive Partitioning Algorithm for Unsupervised Clustering

The Principal Direction Divisive Partitioning (PDDP) algorithm is a fast and scalable clustering algorithm [3]. The basic idea is to recursively split the data set into sub-clusters based on principal direction vectors. However, the PDDP algorithm can yield poor results, especially when cluster structures are not well-separated from one another. Its stopping criterion is based on a heuristic th...

متن کامل

PAC-Bayesian Generalization Bound for Density Estimation with Application to Co-clustering

We derive a PAC-Bayesian generalization bound for density estimation. Similar to the PAC-Bayesian generalization bound for classification, the result has the appealingly simple form of a tradeoff between empirical performance and the KL-divergence of the posterior from the prior. Moreover, the PACBayesian generalization bound for classification can be derived as a special case of the bound for ...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1710.07954  شماره 

صفحات  -

تاریخ انتشار 2017